

香港中文大學

The Chinese University of Hong Kong

#### CSCI5550 Advanced File and Storage Systems Lecture 06: Flash Memory

#### Ming-Chang YANG

mcyang@cse.cuhk.edu.hk

#### Outline



I/O Stack

- Flash Memory: Why and How NAND Flash Technology Inherent Challenges System Architecture Flash Translation Layer Address Mapping - Garbage Collection - Wear Leveling Multilevel I/O Parallelism
- Flash-aware File System
  - Flash-Friendly File System (F2FS)



## Why Flash Memory



- Flash memory is a widely used memory/storage technology in today's products.
  - It is a type of non-volatile memory.
    - Data can be persisted under power loss.
  - **Revenue Growth**: 40% every year!
  - Volume Price: ~40% annual reduction!



# Solid-State Drive vs. Hard Disk Drive



# Solid-State Drive (SSD)

#### Hard Disk Drive (HDD)





- ✓ Faster performance
- ✓ No vibrations or noise
- ✓ Shock resistance
- ✓ More energy efficient
- ✓ Lighter and smaller

CSCI5550 Lec06: Flash Memory

✓ Cheaper per GB✓ Reliability

# **The Greatest Invention in 1990s**

- Invented by Dr. Fujio Masuoka (舛岡 富士雄) born in 1943, while working for Toshiba around 1980.
- First presented in *IEEE International Electron Devices Meeting*, 1984.
- First NAND flash first introduced as SmartMedia storage in 1995.
- First **NOR flash** commercialized by Intel in 1998 to replace the readonly memory (ROM) to store **BIOS** and **firmware**.







#### NAND Flash vs. NOR Flash





https://user.eng.umd.edu/~blj/CS-590.26/nand-presentation-2010.pdf

# NAND Flash Technology





# Single-Level Cell & Multi-Level Cell (1/2)

- Single-Level Cell (SLC): <u>one bit per cell</u>
  - SLC provides faster read/write speed, lower error rate and longer endurance with higher cost.
- Multi-Level Cell (MLC): *multiple bits per cell* 
  - MLC allows each memory cell to store multiple bits of information with degraded performance and reliability.
    - Multi-Level Cell (MLC<sub>x2</sub>): 2 bits per cell
    - Triple-Level Cell (TLC): 3 bits per cell
    - Quad-Level Cell (QLC): 4 bits per cell



CSCI5550 Lec06: Flash Memory

## Single-Level Cell & Multi-Level Cell (2/2)



- Low efficiency in programming/verifying low bit(s)
- High bit error rate in low bit(s)

# **Evolution of NAND Flash**

- Scaling down **floating-gate** cell is challenging.
  - The oxide thickness must be more than 6-nm.
- Charge trap flash (CTF) becomes popular.
  - It uses a silicon nitride film to suck electrons.
- 3D flash further scales down the feature size and increase areal density by building tall.



CSCI5550 Lec06: Flash Memory

24

#### Outline



- Flash Memory: Why and How NAND Flash Technology Inherent Challenges System Architecture Flash Translation Layer Address Mapping - Garbage Collection - Wear Leveling Multilevel I/O Parallelism Flash-aware File System
  - Flash-Friendly File System (F2FS)



## **Inherent Challenges**



- Common challenges of NAND flash:
  - ① Asymmetric operation units between read/write and erase
  - ② Erase before writing (a.k.a., write-once property)
  - ③ Limited endurance
  - ④ Data errors caused by write and read disturb
  - ⑤ Data retention errors
- Sophisticated management techniques are needed to make flash become better.

# **① Asymmetric Operation Units**



13

- Flash cells are organized into pages, and hundreds of pages are grouped into a block.
  - A page is further divided into data area and spare area.
    - The spare area keeps redundancy for error correction or metadata.
- Asymmetric Operation Units: Flash cells can only be read or written in the unit of a page; while all pages of a block need to be erased at a time.



## **②** Erase before Write (Write-once)



- Erase before Write: Once written to "0", the only way to reset a flash cell to "1" is by erasing.
  - The erase operation sets all bits in a **block** to "1"s at a time.



• Write-once Property: A flash page <u>cannot</u> be overwritten until the residing **block** is erased first.

## **③ Limited Endurance**

- Limited Endurance: A flash block can only endure a limited number of program/erase (P/E) cycles.
  - SLC:  $60K \sim 100K$  P/E cycles
  - MLC:  $1K \sim 10K$  P/E cycles
  - TLC: < 1K P/E cycles
- Reason: The oxide layer may be "worn out".



# ④ Read/Write Disturb

- Read/Write Disturb: Reading or writing a page may result in the "weak programming" on its neighbors.
  - Solution to Write Disturb: Programming pages of a block
     "in a sequential order" (a.k.a., sequential write constraint).



## **⑤ Data Retention Errors**



- Data retention time defines how long the written data remains valid within a flash cell.
  - It is inversely related to the number of P/E cycles.
- Electrons leak over time and result in retention errors.
  - Solution: "Correct-and-refresh pages" from time to time.
    - This solution can also reset the read/write disturb.



Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime (ICCD'12)

#### Outline



- Flash Memory: Why and How

   NAND Flash Technology
   Inherent Challenges
- System Architecture
- Flash Translation Layer
  - Address Mapping
  - Garbage Collection
  - Wear Leveling
  - Multilevel I/O Parallelism
- Flash-aware File System
  - Flash-Friendly File System (F2FS)



## **System Architecture**

- There are two typical ways to address the inherent challenges of flash memory:
  - ① Implementing a Flash Translation Layer in the device.
  - ② Designing a Flash-aware File System in the host.

Application



Legacy File System (e.g., Ext2, FAT, LFS) Flash-aware File System (e.g., JFFS, YAFFS, F2FS)

SignalFlash Translation LayerNAND Flash Memory

NAND Flash Memory

#### Outline



- Flash Memory: Why and How

   NAND Flash Technology
   Inherent Challenges
- System Architecture
- Flash Translation Layer
  - Address Mapping
  - Garbage Collection
  - Wear Leveling
  - Multilevel I/O Parallelism
- Flash-aware File System
  - Flash-Friendly File System (F2FS)



# **Flash Translation Layer**

- Flash Translation Layer (FTL) is a firmware inside the flash device to make the NAND flash memory "appear as" a block device to the host.
  - It consists of three major components: ① address translator, ② garbage collector, and ③ wear leveler.



# **① Address Translator**



- Due to the write-once property, the out-place update is adopted to write the updated data to free pages.
  - Address translator maps logical block addresses (LBAs) from the host to physical page addresses (PPAs) in flash.
    - The mapping table is kept in the **memory space** of the flash device.



# Page-Level, Block-Level, and Hybrid



#### • Page-Level (PL)



#### Block-level (BL)



| LBA  | PBA   |
|------|-------|
| 0~3  | X → Y |
| 4~7  |       |
| 8~11 |       |
|      |       |







Data Block(s) share Log Block(s)

# Hybrid FTL



- Hybrid FTLs logically partition blocks into two groups:
  - Data Blocks are mapped via the block-level mapping.
  - Log/Update Blocks are mapped via a page-level mapping.
    - Any update on data blocks are performed by writes to the log blocks.
    - Few log blocks are shared by all data blocks.



#### **Expensive Merge of Hybrid FTL**



Hybrid FTLs induce costly garbage collection.



## **Demand-based Address Translation**



- Keeping all mapping tables in RAM is ineffective.
- Page-level translation for all data with limited RAM.
  - Map of data pages are stored in translation pages on flash.
  - Translation pages are cached in RAM on demand.
  - Map of translation pages (i.e., global translation table) are kept in **RAM** for efficient lookup.



# **DFTL: An Working Example**





Data Block

Translation Block

(1) Request to DLP N 1280 incurs a miss in Cached Mapping Table (CMT), (2) Victim entry DLP N 1 is selected, its corresponding translation page MPPN 21 is located using Global Translation Directory (GTD), (3)-(4) MPPN 21 is read, updated (DPPN 130 → DPPN 260) and written to a free translation page (MPPN 23), (5)-(6) GTD is updated (MPPN 21  $\rightarrow$  MPPN 23) and DLP N 1 entry is erased from CMT. (7)-(11) The original request's (DLP N 1280) translation page is located on flash (MPPN 15). The mapping entry is loaded into CMT and the request is serviced. Note that each GTD entry maps 512 logically consecutive mappings. CSCI5550 Lec06: Flash Memory DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address Mappings (ASPLOS'09)

# **②** Garbage Collector



- Since the out-place update leaves multiple versions of data, the garbage collector is to reclaim pages occupied by stale data by erasing its residing blocks.
  - The live-page copying is needed to migrate pages of the latest versions of data <u>before</u> the erase operation.



# **Typical GC Procedure**



- ① Victim Block Selection: Pick one (or more) block that is "most worthy" to be reclaimed
- ② Live-Page Copying: Migrate all live pages out
- ③ Victim Block Erasing: Reclaim the space occupied by dead pages



# **Typical Victim Block Selection Policies**

- Random selects the victim block in a <u>uniformly-</u> random manner to yield a long-term average use.
- FIFO (or RR) cleans blocks in a <u>round-robin manner</u> for minimized P/E cycle difference.
- Greedy cleans the block with the largest number of dead pages for minimized live-page copying.
- Cost-Benefit (CB) (and its variants) cleans the block with the largest benefit-cost for wear leveling:

$$\frac{benefit}{cost} = \frac{age \times (1-u)}{2u}$$

- age: invalidated time period = current time the time when a page of the block was lastly invalidated
- *u*: percentage of live pages of the block

# Page Liveness

- Keep the LBA into the spare area.
- Check both the spare area of pages and the mapping table during GC.
  - Live: LBA matches the mapping.
  - Dead: LBA mismatches the mapping.

| LBA | PPA    |
|-----|--------|
| 0   | (O, 3) |
| 1   | (Q, 0) |
| 2   | (Q, 1) |
| 3   | (Q, 3) |
|     |        |



#### **③ Wear Leveler**



- Since each block can only endure a limited number of P/E cycles, the wear leveler is to prolong the overall lifetime by evenly distributing erases.
  - The main objective is to prevent any flash block from being worn out "prematurely" than others.



# **Dynamic vs. Static Wear Leveling**



- Wear leveling is classified into static or dynamic based on the type of blocks involved in WL:
  - Dynamic Wear Leveling
    - Let free blocks have <u>closer</u> erase counts
    - How? Simply use <u>free block</u> with lower erase count (i.e., young block) to service writes

- Static Wear Leveling

young old

free

blk

- Let all blocks (free + used) have <u>closer</u> erase counts
- Actively move cold data in young block to elder block
- Then use this (previouslyused) young block to service writes

old

blk



CSCI5550 Lec06: Flash Memory

write

old young

used

blk

# **Dynamic Wear Leveling (DWL)**



- DWL achieves wear leveling only for hot blocks.
  - Hot Block: a block mainly containing hot data
  - Cold Block: a block mainly containing cold data
- Hot blocks will be worn out earlier than cold blocks.



# Static Wear Leveling (SWL)



- SWL achieves wear leveling for all blocks by proactively moving cold data to young blocks.
  - Extra data migrations are introduced.
  - Performance is traded for lifetime/endurance.



Improving Flash Wear-Leveling by Proactively Moving Static Data (TC'10) Rejuvenator: A static wear leveling algorithm for NAND flash memory with minimized overhead (MSST'11)

# **Progressive Wear Leveling (PWL)**



- When, and how often WL should be performed?
- WL should be performed in a **progressive way**:
  - Prevent WL in the early stages for better performance.
  - Progressively trigger WL to prolong lifetime.
    - More and more WLs are performed over time.



#### **Error-Rate-Aware Wear Leveling**



- Key Observations: Flash is not perfect!
  - P/E cycles might inaccurately reflect the reliability of flash.
  - Blocks might have different bit-error rates (BERs) when enduring the same P/E cycles due to process variation.
- Idea: BER could be a better metric to WL designs.
  - The error correction hardware can report BER to FTL.



New ERA: New Efficient Reliability-Aware Wear Leveling for Endurance Enhancement of Flash Storage Devices (DAC'13)

# Multilevel I/O Parallelism (1/2)



- The internal of flash devices is highly hierarchical:
   Channel → Chip → Die → Plane → Block → Page
- Multiple I/O operations can be performed concurrently.



# Multilevel I/O Parallelism (2/2)



#### • The optimal priority order of parallelism should be:

- ① Channel-level
- ② Die-level
- ③ Plane-level
- ④ Chip-level

| SSD  | ClCpDP. | Α   | Page | Priority order         |
|------|---------|-----|------|------------------------|
| SSD1 | 8-4-2-2 | Yes | 2KB  | chip>die>plane>channel |
| SSD2 | 8-4-2-2 | Yes | 2KB  | channel>chip>die>plane |
| SSD3 | 1-4-2-2 | Yes | 2KB  | channel>chip>die>plane |
| SSD4 | 1-4-2-2 | Yes | 2KB  | channel>die>chip>plane |
| SSD5 | 1-4-2-2 | Yes | 2KB  | channel>plane>die>chip |
| SSD6 | 1-4-2-2 | Yes | 2KB  | channel>die>plane>chip |



### **Flash Management is Complex**



- The controller of flash memory device is complex.
  - It must perform a myriad of tasks to receive, monitor and deliver data efficiently and reliably.



#### Outline



- Flash Memory: Why and How

   NAND Flash Technology
   Inherent Challenges
- System Architecture
- Flash Translation Layer
  - Address Mapping
  - Garbage Collection
  - Wear Leveling
  - Multilevel I/O Parallelism
- Flash-aware File System
  - Flash-Friendly File System (F2FS)



## **Recall: System Architecture**



- There are two typical ways to address the inherent challenges of flash memory:
  - Implementing a Flash Translation Layer in the device. (1)
  - Designing a Flash-aware File System in the host. 2

Application



NAND Flash Memory

NAND Flash Memory

## Flash-aware File System

- Random writes are bad to flash devices.
  - Free space fragmentation
  - Degraded performance (due to GC)
  - Reduced lifetime (due to GC)
- Writes must be reshaped into sequential writes.
   Same as Log-structured file system (LFS) for HDD!
- Most flash-aware file systems are derived from LFS:
  - Journaling Flash File System (JFFS)
  - Yet Another Flash File System (YAFFS)
  - Flash-Friendly File System (F2FS)
    - Publicly available, included in Linux mainline kernel since Linux 3.8.

# **Recall: Log-structured File System**



- LFS first buffers all writes in an in-memory **segment** and commits the segment to disk **sequentially**.
  - The Inode Map (imap)
    - Maps from an inode-number to the disk-address of the *most* <u>recent version</u> of the inode (i.e., one more mapping!).
    - Updated whenever an inode is written to disk.
    - Placed right next to where data block (D) and inode (I[k]) reside.
  - The Checkpoint Region (CR):
    - Records disk pointers to all latest pieces of imap.
    - Flushed to disk periodically (e.g., every 30 seconds).



# Flash-friendly On-Disk Layout



- **Key:** There is no re-position delay in flash memory!
- Flash Awareness
  - All the **FS metadata** are located together for locality.
  - Start address of main area is aligned to the zone size.
    - block=4KB; segment=2MB; section=n segments; zone=m sections.
  - File system cleaning (i.e., GC) is done in a unit of section.

#### Cleaning Cost Reduction

- Multi-head logging for hot/cold data separation.



## **LFS Index Structure**



- LFS manages the disk space as one big log.
- LFS has the update propagation problem.



CSCI5550 Lec06: Flash Memory

## **F2FS Index Structure**

- Restrained update propagation by *node address table*.
- F2FS manages the flash space as *multi-head log*.



CSCI5550 Lec06: Flash Memory

# Multi-head Logging

- Data temperature:
  - Node > Data
  - Direct Node > Indirect Node
  - Directory > User File

| Туре | Temp. | Objects                              |  |  |
|------|-------|--------------------------------------|--|--|
| Node | Hot   | Direct node blocks for directories   |  |  |
|      | Warm  | Direct node blocks for regular files |  |  |
|      | Cold  | Indirect node blocks                 |  |  |
|      | Hot   | Directory entry blocks               |  |  |
|      | Warm  | Data blocks made by users            |  |  |
| Data |       | Data blocks moved by cleaning;       |  |  |
|      | Cold  | Cold data blocks specified by users; |  |  |
|      |       | Multimedia file data                 |  |  |

- Separation of multi-head logs in NAND flash
  - Hot/cold separation reduces the cleaning (GC) overhead.



# Crash Recovery

#### ① Checkpoint

- Maintain shadow copy of checkpoint, NAT, SIT blocks
- Recovers the latest checkpoint







# Summary



User

**Kernel** 

I/O Stack

Application

**File System** 

**Block Layer** 

**Device Driver** 

I/O Device

- Flash Memory: Why and How NAND Flash Technology Inherent Challenges System Architecture Flash Translation Layer Address Mapping - Garbage Collection - Wear Leveling Multilevel I/O Parallelism
- Flash-aware File System
  - Flash-Friendly File System (F2FS)